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Escherichia coli (E. coli) bacteria can damage DNA of the gut lining cells and may encourage the development 
of colon cancer according to recent reports. Genetic switches are specific sequence motifs and many of them are 
drug targets. It is interesting to know motifs and their location in sequences. At the present study, Gibbs sampler 
algorithm was used in order to predict and find functional motifs in E. coli NC101 contig 1. The whole genomic 
sequence of Escherichia coli NC101 contig 1 were retrieved from http://www.ncbi.nlm.nih.gov (NCBI 
Reference sequence: NZ_AEFA0 100000 1.1) in order to be analyzed with DAMBE software and BLAST. The 
results showed that the 6-mer motif is CUGGAA in most sequences (genesl-3, 8, 9, 12, 14-18, 20-23, 25, 27, 29, 
31-34), CUUGUA for gene 4 , CUGUAA for gene 5, CUGAUG for gene 6, CUGAUA for gene7, CUGAAA for 
genes 10, 11, 13, 26, 28, and CUGGAG for gene 19, and CUGGUA for gene30 in E. coli NC101 contig 1. It is 
concluded that the 6-mer motif is CUGGAA in most sequences in E. coli NC101 contig 1. The present study may 
help experimental studies on elucidating the pharmacological and phylogenic functions of the motifs in E. coli. 
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Infection of eukaryotic cells with pks+ E. coli 
strains induces host-cell DNA double strand 
breaks (DSBs) and activation of the DNA damage 
signaling cascade, including the ATM-CHK- 
CDC25-CDK1 pathway and SeiT39 phospho- 
rylation of histone H2AX (1). A genome can assure 
cells life whenever its encoded genes are activated 
or inactivated during molecular and cellular 
changes in order to answer to environmental factors 
and production of various RNA and proteins on 
time and correct place (2). The aim of motif 
discovery is to find patterns in protein or nucleotide 
sequences to understand the function and structure 



of the molecules the sequences represent (3). Motif 
extraction of MLA or multiple local alignments is 
often used to determine DNA sites that are 
distinguished by TF or transcription factors. This is 
based on the assumption that DNA sequences 
upstream of coregulated genes contain similar 
nucleotide subsequences (4). Genetic switches are 
specific sequence motifs and many of them are drug 
targets (5). These contain intron branching-point 
site, transcription factor binding sites, intron- 
splicing sites, etc. Gibbs sampler is a Monte Carlo 
algorithm used in order to find these motifs (5). 
Monte Carlo algorithm method was created by 



* 

Corresponding author: Department of Biology, University of Zabol, Zabol, Iran. Email: reza.motaleb @uoz.ac.ir ; 
rezamotalleb@gmail.com 



Functional Motifs in Escherichia Coli NC101 

Stanislaw Ulam and developed by nuclear weapon 
projects in USA (6). Gibbs sampler has been used 
to identify functional motifs in proteins (7), 
multiple sequence alignment (8), and biological 
image processing (9). The main element of a Gibbs 
sampler is position weight matrix or PWM. The 
PWM scores or PWMS has been reported as a scale 
of the motif strength (5). Escherichia coli are 
anaerobic bacteria and the most common 
population of bacteria in the intestinal flora of 
human. E. coli can make colony in the intestine few 
days after birth and permanently during human life. 
Strains of E. coli can be categorized into four main 
groups (A, Bl, B2, and D) and B2 group can persist 
in the colon longer than the others (10). It was 
reported that E. coli strains of B2 phylotype (e.g. E. 
coli NC101), carry a genomic pks island (a gene 
cluster coding nonribosomal peptide synthetases or 
NRPS and polyketide synthetases or PKS), produce 
Colibactin (a peptide-polyketide genotoxin) that 
can induce damage of DNA by double-strand 
breaks (DSBs) (11) and may develop colon cancer 
(12). In the present study, Gibbs sampler was used 
to identify functional motifs by BLAST and 
DAMBE software in order to distinguish motifs in 
E. coli strain NC101 contigl. 



Materials and Methods 



This investigation was started in the spring of 
2013 and the data analysis was performed at 
bioinformatics facility of Faculty of Science 
at Zabol University. Genome sequences of 
E. coli NC101 (NCBI Reference sequence: 
Z_AEFA0 100000 1.1) were retrieved from 
http://www.ncbi.nlm.nih.gov (NCBI Reference 
sequence: NZ_AEFA0 100000 1.1) to find branch 
point sequence or BPS in E. coli NC101 contigl by 
DAMBE (5). 

BLAST search of the E. coli NC101 genome 
(accession NZ_AEFA00000000) confirmed the 
presence of pks. PWM is computed as 
PWM,.. = log 2 ^(l). 



Where i=l, 2, 3 and 4 corresponding to A, C, G and 
U, respectively, and j is site index, and p-, is the 
background frequency of nucleotide i, and py is the 
site specific nucleotide frequency for nucleotide i at 
site j. The PWMS for a particular motif is computed 
as PWMS = £pwm, , (2) where L is length of the 
motif (5). 



Results 



Gibbs sampler was employed to find 
functional motifs by DAMBE in order to identify 
genetic motifs with Gibbs sampler in E. coli NC101 
contigl. Figure 1 shows shared motif in an aligned 
format in red color which is CUGGAA in most 
sequences (Fig. lb). The main Gibbs sampler 
output is the sequences with aligned motifs as 
shown in Figure lb and a site-specific frequency 
matrix (position weight matrix) presented in Table 
Id respectively. Table la shows the total number of 
nucleotides in the sequences. The total number of 
nucleotides for 34 sequences is 31509, with 7622, 
7977, 8879 and 7031 for A, C, G and U 
respectively. The partial output (Table lb and c) 
showed that the 6-mer motif is CUGGAA. The site- 
specific frequencies and PWM were shown in Table 
lc and d in order to find and monitor other 
sequences for the presence of such motifs. The last 
part of the results (Table 2) shows the motifs start 
point. As shown again in Table 2, the 6-mer motif 
is CUGGAA in most sequences. Figure 2 shows the 
scatter diagram of SID and S2D with E. coli 
NC101 contigl sequences length. 



Discussion 



The finding of motifs in DNA sequences is a 
central problem in computational molecular 
biology, and through many computational methods, 
Gibbs sampling algorithm is a great promise which 
is used for finding functional motifs in the co- 
expressed genes (13). Motif finding is becoming an 
important toolbox for microbiologists likewise 
other DNA and protein computational molecular 
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(a) 

51 AT GAAC C AC T C C T T AAAAC C CT GGAACACAT T T GGC AT T GAT CAT AAT GC T C AGC . 

52 ATGAAGGATAACACCGTGCCACTGAAATTGATCGCCCTGTTAGCGAACGGTGAAT . 

53 ATGAGTATAAAAGAGCAAACGTTAATGACGCCTTACCTACAGTTTGACCGCAACC . 

54 GCCGACTTAGCTCAGTAGGTAGAGCAACTGACTTGTAATCAGTAGGTCACCAGTT . 

55 GGTGGGGTTCCCGAGCGGCCAAAGGGAGCAGACTGTAAATCTGCCGTCACAGACT . 

56 GTATAATGGCTATTACCTCAGCCTTCCAAGCTGATGATGCGGGTTCGATTCCCGC . 

57 GCTGATATAGCTCAGTTGGTAGAGCGCACCCTTGGTAAGGGTGAGGTCGGCAGTT . 

58 ATGTCTAAAGAAAAGTTTGAACGTACAAAACCGCACGTTAACGTCGGTACTATCG . 

59 ATGAGTGCGAATACCGAAGCTCAAGGAAGCGGGCGCGGCCTGGAAGCGATGAAGT . 

510 ATGTCTGAAGCTCCTAAAAAGCGCTGGTACGTCGTTCAGGCGTTTTCCGGTTTTG . 

511 ATGGCTAAGAAAGTACAAGCCTATGTCAAGCTGCAGGTTGCAGCTGGTATGGCTA. 

512 ATGGCTAAACTGACCAAGCGCATGCGTGTTATCCGCGAGAAAGTTGATGCAACCA. 

513 ATGGCTTTAAATCTTCAAGACAAACAAGCGATTGTTGCTGAAGTCAGCGAAGTAG . 

514 ATGTCTATCACTAAAGATCAAATCATTGAAGCAGTTGCAGCTATGTCTGTAATGG . 

515 ATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTAAACGTC . 

516 GTGAAAGATTTATTAAAGTTTCTGAAAGCGCAGACTAAAACCGAAGAGTTTGATG . 

517 ATGAAAACCTTCAGCGATCGCTGGCGACAACTGGACTGGGATGACATCCACCTGC . 

518 ATGTTACGTATTGCGGACAAAACGTTTGCTTCACATCTGTTTACTGGCACCGGAA. 

519 ATGCAGATCCTGTTTAACGATCAACCGATGCAGTGTGTCGCCGGACTAACTGTTC . 

520 ATGAATGACCGTGACTTTATGCGTTATAGCCGCCAAATCCTGCTCGACGATATCG . 

521 ATGTATCAGCCAGATTTTCCTCCTGTACCCTTTCGTTTAGGACTGTACCCGGTGG . 

522 ATGTCTGTAACAAAACTGACCCGCCGCGAACAACGCGCCCAGGCCCAACATTTTA . 

523 ATGCTTAACCAGCTCGATAACCTGACGGAACGCGTCAGAGGAAGTAACAAACTGG . 

524 ATGGATCGTATAATTGAAAAATTAGATCACGGCTGGTGGGTCGTCAGCCATGAAC . 

525 ATGACCGAACTTAAAAACGATCGTTATCTGCGGGCGCTGCTGCGCCAGCCCGTTG . 

526 ATGGATCTCGCGTCATTACGCGCTCAACAAATTGAACTGGCTTCTTCTGTGATCC . 

527 ATGTTACAAAACCCAATTCATCTGCGTCTGGAGCGCCTAGAAAGCTGGCAGCACG . 

528 AT GAAC AAGAC T C AAC T GAT T GAT GT AAT T GC AGAGAAAGCAGAAC T GT C CAAAA . 

529 ATGCTGGCGGGCGCTCTGTTTCTTACTGCCTGTAGTCACAACTCTTCACTTCCTC . 

530 ATGAAACGGAACACGAAAATTGCCCTGGTAATGATGGCGCTTTCAGCAATGGCGA . 

531 ATGCGTTTTATGCAACGTTCTAAAGACTCCTTAGCTAAATGGTTAAGCGCGATCC . 

532 ATGACGCACGATAATATCGATATTCTGGTGGTGGATGATGACATTAGCCACTGCA . 

533 ATGAAAGTATTAGTGATTGGTAACGGCGGGCGCGAGCACGCGCTGGCCTGGAAAG . 

534 ATGCAACAACGTCGTCCAGTCCGCCGCGCTCTGCTCAGTGTTTCTGACAAAGCCG . 



I 
I 

i Gibbs sampler 

I 

(b) 



Sj AUGAACCACUCCUUAAAACCCUGGAACACAUUUGGCAUUGAUCAUAAUG. . . 

5 2 AUAUUUGUCGAUGUUCUGGCGUCUGGAACAGGGCCCGGCGGCGGCGAUUGGUUUAAGUCU. . . 

5 3 " GAAUCUGCGCCGUCAGGCAGUUCUGGAACAGUUUCUUGGUACCAACGGGCAACGCAUUCC. . . 



S 34 UGCUCGCUGGAAGAUGCGGUAGAGAACAUCGAUAUCGGCGGC . . . 



Fig 1. The sequences of E. coli NC101 contigl. The above panel represents the data input in Gibbs sampler (a). The below part represents 
the output of the motifs (i.e.,CUGGAA; in red color) through the sequences (b). S1-S34 correspond to sequence 1 to sequence 34. 



Int J Mol Cell Med Autumn 2013; Vol 2 No 4 179 



Functional Motifs in Escherichia Coli NC101 





(a) Global alignment score (F) = 230.9101 






Frequency Table 








Code Count 


Freq 






A 7622 


0.2419 






C 7977 


0.2532 






G 8879 


0.2818 






U 7031 


0.2231 






(b) Final site-specific counts 


A 


C 


G 


U 


1 0 


34 


0 


0 


2 0 


0 


0 


34 


3 0 


0 


33 


1 


4 7 


0 


26 


1 


5 29 


0 


0 


5 


6 32 


0 


2 


0 


(c) Final site-specific frequencies 


A 


c 


G 


u 


1 0.00691 


0.97866 


0.00805 


0.00638 


2 0.00691 


0.00723 


0.00805 


0.97780 


3 0.00691 


0.00723 


095091 


0.03495 


4 020691 


0.00723 


0.75091 


0.03495 


5 0.83548 


0.00723 


0.00805 


0.14923 


6 0.92120 


0.00723 


0.06519 


0.00638 


(d) Final PWM 


A 


C 


G 


U 


1 3.55288- 


1.34992 


3.55495- 


3.55600- 


2 3.55288- 


3.55757- 


3.55495- 


1.47685 


3 3.55288- 


3.55757- 


1.21665 


1.85463- 


4 0.15376- 


3.55757- 


0.98051 


1.85463- 


5 1.24196 


3.55757- 


3.55495- 


0.40295- 


6 1.33962 


3.55757- 


1.46340- 


3.55600- 



Number of input sequences: 34; Width of motif: 6. a: Frequency table; b: Final site-specific counts in motifs; 
c: Final site-specific frequencies in motifs; d: Final PWM in motifs. A= adenine; C= cytosine; G= guanine; 
U= uracil 
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Fig. 2. Scatter diagram of SID and S2D in E. coli NC101 contigl sequences. 
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Table 2. Gibbs sampler results of E. coli NC101 contigl sequences for motif, start location and PWMS 
identification 



SeqName 






Motif 


Start 


PWMS 


lcl INZ_AEF AO 1 00000 1 . 


1 i 

— i 


jene 1 


CUGGAA 


20 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ < 

— ; 


jene 2 


CUGGAA 


414 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 
— ^ 


*ene 3 


CUGGAA 


225 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 
— ; 


jene 4 


CUUGUA 


31 


17.9811 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 

— i 


*ene 5 


CUGUAA 


32 


117.9619 


lcllNZ_AEFA01000001 . 


[ 1 

— < 


*ene 6 


CUGAUG 


39 


7.5632 


lcl INZ_AEF AO 1 00000 1 . 


I 1 

— i 


*ene 7 


CUGAUA 


1 


124.7508 


lcl INZ_AEF AO 1 00000 1 . 


[ j 


jene 8 


CUGGAA 


438 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 
— < 


*ene 9 


CUGGAA 


39 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ < 
— ; 


*ene 10 


CUGAAA 


471 


646.2746 


lcl INZ_AEF AO 1 00000 1 . 


1 1 

— i 


jene 1 1 


CUGAAA 


237 


646.2746 


lcllNZ_AEFA01000001 . 


[ I 

— < 


^ene 12 


CUGGAA 


564 


2009.2159 


lcl INZ_AEF AO 1 00000 1 . 


1 1 

— i 


jene 1 3 


CUGAAA 


213 


646.2746 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 
— ; 


^ene 14 


CUGGAA 


330 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 
— < 


jene 15 


CUGGAA 


144 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 
— ; 


^ene 16 


CUGGAA 


504 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 

— i 


jene 17 


CUGGAA 


159 


2009.2159 


lcllNZ_AEFA01000001 . 


[ ( 

— < 


^ene 1 8 


CUGGAA 


417 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


I 1 

— i 


jene 19 


CUGGAG 


63 


121.8117 


lcl INZ_AEF AO 1 00000 1 . 


[ j 


^ene 20 


CUGGAA 


606 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 
— < 


jene 2 1 


CUGGAA 


435 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ j 


^ene 22 


CUGGAA 


63 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 

— i 


jene 23 


CUGGAA 


237 


2009.2156 


lcllNZ_AEFA01000001 . 


[ I 

— i 


^ene 24 


CUGGUA 


671 


387.8401 


lcl INZ_AEF AO 1 00000 1 . 


[ 1 

— i 


jene 25 


CUGGAA 


765 


2009.2156 


lcl INZ_AEF AO 1 00000 1 . 


[ < 


jene 26 


CUGAAA 


153 


646.2746 


lcl INZ_AEF AO 1 00000 1 . 


I | 


jene_27 


CUGGAA 


387 


2009.2156 


lcllNZ_AEFA01000001. 


u 


jene_28 


CUGAAA 


105 


646.2746 


lcllNZ_AEFA01000001. 


1-1 


jene_29 


CUGGAA 


552 


2009.2156 


lcllNZ_AEFA01000001. 


u 


^ene_30 


CUGGUA 


24 


387.8401 


lcllNZ_AEFA01000001. 


1-1 


jene_3 1 


CUGGAA 


147 


2009.2156 


lcllNZ_AEFA01000001. 


u 


jene_32 


CUGGAA 


348 


2009.2156 


lcllNZ_AEFA01000001. 




iene_33 


CUGGAA 


47 


2009.2156 


lcllNZ_AEFA01000001. 




iene_34 


CUGGAA 


354 


2009.2156 



Mean 1429.4078 
Standard deviation 812.3610 



biology sequence analysis methods. These 
techniques can provide very useful and valuable 
information with very lower cost compared to 
laboratory experiments. The most common 
application of motif finding is to determine and find 
the TFBS or transcription factor binding sites (14). 



Transcription factors (TFs) attach most often to 
small segments of DNA (binding sites) in DNA 
upstream of a gene to activate or inactivate of gene 
transcription. Their DNA-binding domains can 
distinguish and recognize motifs. TRPF or 
transcription regulatory protein factors often 
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connect to DNA as homo or hetero-dimers. 
Therefore they distinguish DNA motifs that are 
spaced motif pairs, inverted or direct repeats. 
However, these motifs are often tedious and 
difficult to identify owing to their high divergence 
(15). Because of multiple binding modes and 
indirect recognition, the action and reaction of 
operators and holorepressors are highly sensitive to 
the experimental conditions (DNA length, buffer 
components, etc). Thus, there are many 
discrepancies about equilibrium constants, kinetic 
data and stoichiometry for these complexes. In 
other words, equilibrium binding constants for 
holorepressor/operator is different from one 
experiment to another (15-16). However, there is no 
unique combination of bases that is shared by all 
binding sites, and although different bases can 
occur at each position, there are clear biases in the 
distribution of bases that occur at each position of 
the binding sites (17-18). PWM has been employed 
in genome investigations such as whole genome 
identification of transcription units (19), 
transcription factor binding sites or TFBS (20), 
transcription initiation sites (21) and translation 
initiation sites (22). Position weight matrix 
sequence analysis has three outputs: the site 
specific frequency, the position weight matrix, and 
PWMS. On the other hand, it is interesting to find 
sequence motifs in a set of co expressed genes by 
microarray experiments (23). If these genes are co 
regulated, thus they share TFBS that could be 
monitored or controlled by similar or common TF 
(24). Gibbs sampler will output a quantitative 
measure of the motifs by computer program. 
PWMS is the log-odds ratio, and the strongest motif 
has the highest PWMS or odds-ratio (5). In this 
work, Gibbs sampler algorithm was employed to 
find the functional motifs in E. coli NC101 contigl 
sequences. The results showed that CUGGAA is a 
6-mer motif that has the highest PWMS of 
2009.2156. 24 out of 34 sequences genes, had the 
6-mer motif of CUGGAA (70.58%) (Table 2). The 



homology between genes based on sequence motifs 
is very important and crucial in order to understand 
the function of uncharacterized genes and may be 
helpful in studying the dynamic behavior of genes 
(25). That is, it may be concluded that they may be 
co regulated. Recently, the researchers reported that 
the transfer of a functional gene from bacteria to 
mammalian cells could occur. They showed that 
engineered E. coli, expressing Inv and HlyA genes 
(from Yersinia pseudotuberculosis and hysteria 
monocytogenes, respectively) are able to attack and 
release DNA into mammalian cells (26). The 
similar phenomenon was also reported in vivo, and 
it was shown that invasive E. coli can carry and 
deliver therapeutic genes to the colonic mucosa in 
mice (27). On the other hand, a successful shRNA 
transfer into mammalian cells was carried out by 
non-pathogenic E. coli through a plasmid (28). 
Bacteria strains for example E. coli, Salmonella, 
and Clostridium can selectively grow and colonize 
in tumors. In fact, scientists have showed that 
bacteria are able to attack primary tumors and 
metastases and they can be used for tumor-selective 
drug delivery (29). 

Our results may help the mentioned scenario 
by finding and discovering the functional genetic 
switches and motifs in E. coli NC101 contigl. The 
branch point sequence could be placed anywhere, 
however, it is preferable to be near the 3' rather 
than the 5' site. Surely, experiments causing step by 
step mutation on each nucleotide of the sequence 
between the donor and the acceptor site, could be 
performed, but this is very tedious and difficult. 
Therefore, one can apply and run the Gibbs sampler 
in order to find all the BPSs. The BPS cuts the E. 
coli NC101 contigl sequences into two sections: 
the upstream part stretching from the 5' site to BPS 
(the SI sequence), and the downstream sequence 
from BPS to the 3' site (the S2 sequence). The 
lengths of SI and S2 sequences are named as SI 
and S2 distances (SID and S2D). If BPS is limited 
to be near the 3' site, thus the S2 distance is smaller 
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than the SI distance and vice versa (5). 

Scatter diagram of SID and S2D of E. coli 
NC101 contigl sequences is shown in Figure 2. 
The results showed that most of the S2D were 
higher than SID (650.11+157.24 and 270.61+37.17 
respectively). 

The present study may help experimental studies on 

elucidating the pharmacological and phylogenic 

functions of the motifs in E. coli. 
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